Method and system for parsing navigation information

ABSTRACT

A method and system for providing customers with access to and analysis of the navigation data collected at customer web sites is provided. In one embodiment, a data warehouse system collects customer data from the customer web sites and stores the data at a data warehouse server. The customer data may include navigation data (e.g., click stream log files), user attribute data of users of the customer web site (e.g., name, age, and gender), product data (e.g., catalog of products offered for sale by the customer), shopping cart data (i.e., identification of the products currently in a user&#39;s shopping cart), and so on. When the data warehouse server receives customer data, it converts the customer data into a format that is more conducive to processing by decision support system applications by which customers can analyze their data. For example, the data warehouse server may analyze low-level navigation events (e.g., each HTTP request that is received by the customer web site) to identify high-level events (e.g., a user session). The data warehouse server then stores the converted data into a data warehouse.

TECHNICAL FIELD

[0001] The described technology relates to analyzing of data relating toevents generated by a computer program.

BACKGROUND

[0002] Today's computer networking environments, such as the Internet,offer mechanisms for delivering documents between heterogeneous computersystems. One such network, the World Wide Web network, which comprises asubset of Internet sites, supports a standard protocol for requestingand receiving documents known as web pages. This protocol is known asthe Hypertext Transfer Protocol, or “HTTP.” HTTP defines a messagepassing protocol for sending and receiving packets of informationbetween diverse applications. Details of HTTP can be found in variousdocuments including T. Berners-Lee et al., Hypertext TransferProtocol—HTTP 1.0, Request for Comments (RFC) 1945, MIT/LCS, May 1996.Each HTTP message follows a specific layout, which includes among otherinformation, a header which contains information specific to the requestor response. Further, each HTTP request message contains a universalresource identifier (a “URI”), which specifies to which network resourcethe request is to be applied. A URI is either a Uniform Resource Locator(“URL”) or Uniform Resource Name (“URN”), or any other formatted stringthat identifies a network resource. The URI contained in a requestmessage, in effect, identifies the destination machine for a message.URLs, as an example of URIs, are discussed in detail in T. Berners-Lee,et al., Uniform Resource Locators (URL), RFC 1738, CERN, Xerox PARC,Univ. of Minn., December 1994.

[0003]FIG. 1 illustrates how a browser application enables users tonavigate among nodes on the web network by requesting and receiving webpages. For the purposes of this discussion, a web page is any type ofdocument that abides by the HTML format. That is, the document includesan “<HTML>” statement. Thus, a web page is also referred to as an HTMLdocument. The HTML format is a document mark-up language, defined by theHypertext Markup Language (“HTML”) specification. HTML defines tags forspecifying how to interpret the text and images stored in an HTMLdocument. For example, there are HTML tags for defining paragraphformats and for emboldening and underlining text. In addition, the HTMLformat defines tags for adding images to documents and for formattingand aligning text with respect to images. HTML tags appear between anglebrackets, for example, <HTML>. Further details of HTML are discussed inT. Berners-Lee and D. Connolly, Hypertext Markup Language-2.0, RFC 1866,MIT/W3C, November 1995.

[0004] In FIG. 1, a web browser application 101 is shown executing on aclient computer 102, which communicates with a server computer 103 bysending and receiving HTTP packets (messages). HTTP messages may also begenerated by other types of computer programs, such as spiders andcrawlers. The web browser “navigates” to new locations on the network tobrowse (display) what is available at these locations. In particular,when the web browser “navigates” to a new location, it requests a newdocument from the new location (e.g., the server computer) by sending anHTTP-request message 104 using any well-known underlying communicationswire protocol. The HTTP-request message follows the specific layoutdiscussed above, which includes a header 105 and a URI field 106, whichspecifies the network location to which to apply the request. When theserver computer specified by URI receives the HTTP-request message, itinterprets the message packet and sends a return message packet to thesource location that originated the message in the form of anHTTP-response message 107. It also stores a copy of the request andbasic information about the requesting computer in a log file. Inaddition to the standard features of an HTTP message, such as the header108, the HTTP-response message contains the requested HTML document 109.When the HTTP-response message reaches the client computer, the webbrowser application extracts the HTML document from the message, andparses and interprets (executes) the HTML code in the document anddisplays the document on a display screen of the client computer asspecified by the HTML tags. HTTP can also be used to transfer othermedia types, such as the Extensible Markup Language (“XML”) and graphicsinterchange format (“GIF”) formats.

[0005] The World Wide Web is especially conducive to conductingelectronic commerce (“e-commerce”). E-commerce generally refers tocommercial transactions that are at least partially conducted using theWorld Wide Web. For example, numerous web sites are available throughwhich a user using a web browser can purchase items, such as books,groceries, and software. A user of these web sites can browse through anelectronic catalog of available items to select the items to bepurchased. To purchase the items, a user typically adds the items to anelectronic shopping cart and then electronically pays for the items thatare in the shopping cart. The purchased items can then be delivered tothe user via conventional distribution channels (e.g., an overnightcourier) or via electronic delivery when, for example, software is beingpurchased. Many web sites are also informational in nature, rather thancommercial in nature. For example, many standards organizations andgovernmental organizations have web sites with a primary purpose ofdistributing information. Also, some web sites (e.g., a search engine)provide information and derive revenue from advertisements that aredisplayed.

[0006] The success of any web-based business depends in large pail onthe number of users who visit the business's web site and that numberdepends in large part on the usefulness and ease-of-use of the web site.Web sites typically collect extensive information on how its users usethe site's web pages. This information may include a complete history ofeach HTTP request received by and each HTTP response sent by the website. The web site may store this information in a navigation file, alsoreferred to as a log file or click stream file. By analyzing thisnavigation information, a web site operator may be able to identifytrends in the access of the web pages and modify the web site to make iteasier to use and more useful. Because the information is presented as aseries of events that are not soiled in a useful way, many softwaretools are available to assist in this analysis. A web site operatorwould typically purchase such a tool and install it on one of thecomputers of the web site. There are several drawbacks with the use ofsuch an approach of analyzing navigation information. First, theanalysis often is given a low priority because the programmers aretypically busy with the high priority task of maintaining the web site.Second, the tools that are available provide little more than standardreports relating to low-level navigation through a web site. Suchreports are not very useful in helping a web site operator to visualizeand discover high-level access trends. Recognition of these high-levelaccess trends can help a web site operator to design the web site.Third, web sites are typically resource intensive, that is they use alot of computing resources and may not have available resources toeffectively analyze the navigation information.

[0007] It would also be useful to analyze the execution of computerprograms, other than web server programs. In particular, many types ofcomputer programs generate events that are logged by the computerprograms themselves or by other programs that receive the events. If acomputer program does not generate explicit events, another program maybe able to monitor the execution and generate events on behalf of thatcomputer program. Regardless of how event data is collected, it may beimportant to analyze that data. For example, the developer of anoperating system may want to track and analyze how the operating systemis used so that the developer can focus resources on problems that aredetected, optimize services that are frequently accessed, and so on. Theoperating system may generate a log file that contains entries forvarious types of events (e.g., invocation of a certain system call).

BRIEF DESCRIPTION OF THE DRAWINGS

[0008]FIG. 1 illustrates how a browser application enables users tonavigate among nodes on the web network by requesting and receiving webpages.

[0009]FIG. 2A is a block diagram illustrating components of the datawarehouse system in one embodiment.

[0010]FIG. 2B is a block diagram illustrating details of the componentsof the data warehouse system in one embodiment.

[0011]FIG. 3 is a block diagram illustrating the sub-components of thedata processor component in one embodiment.

[0012]FIG. 4 is a block diagram illustrating some of the tables of thelocal data warehouse and the main data warehouse in one embodiment.

[0013]FIG. 5 is a flow diagram illustrating the parse log data routinethat implements the parser in one embodiment.

[0014]FIG. 6 is a flow diagram of the filter log entry routine in oneembodiment.

[0015]FIG. 7 is a flow diagram illustrating the normalize log entryroutine.

[0016]FIG. 8 is a flow diagram of the generate dimensions routine in oneembodiment.

[0017]FIG. 9 is a flow diagram of the identify logical site routine inone embodiment.

[0018]FIG. 10 is a flow diagram of the identify user routine in oneembodiment.

[0019]FIG. 11 is a flow diagram of the identify page type routine in oneembodiment.

[0020]FIG. 12 is a flow diagram illustrating the identify events routinein one embodiment.

[0021]FIG. 13 is a flow diagram illustrating the identify sessionsroutine in one embodiment.

[0022]FIG. 14 is a flow diagram of the generate aggregate statisticsroutine in one embodiment.

[0023]FIG. 15 is a flow diagram of the import log data routineimplementing the importer in one embodiment.

[0024]FIG. 16 is a flow diagram of the load dimension table routine andone embodiment.

[0025]FIG. 17 is a flow diagram of the load fact table routine in oneembodiment.

[0026]FIG. 18 is a flow diagram illustrating the identify user aliasesroutine in one embodiment.

DETAILED DESCRIPTION

[0027] A method and system for providing customers with access to andanalysis of event data (e.g., navigation data collected at customer websites) is provided. The event data may be stored in log files andsupplemented with data from other sources, such as product databases andcustomer invoices. In one embodiment, a data warehouse system collectscustomer data from the customer web sites and stores the data at a datawarehouse server. The customer data may include application event data(e.g., click stream log files), user attribute data of users of thecustomer web site (e.g., name, age, and gender), product data (e.g.,catalog of products offered for sale by the customer), shopping cartdata (i.e., identification of the products currently in a user'sshopping cart), and so on. The data warehouse server interacts with thecustomer servers to collect to the customer data on a periodic basis.The data warehouse server may provide instructions to the customerservers identifying the customer data that is to be uploaded to the datawarehouse server. These instructions may include the names of the filesthat contains the customer data and the name of the web servers on whichthe files reside. These instructions may also indicate the time the daywhen the customer data is to be uploaded to the data warehouse server.When the data warehouse server receives customer data, it converts thecustomer data into a format that is more conducive to processing bydecision support system applications by which customers can analyzetheir data. For example, the data warehouse server may analyze low-levelnavigation events (e.g., each HTTP request that is received by thecustomer web site) to identify high-level events (e.g., a user session).The data warehouse server then stores the converted data into a datawarehouse. The data warehouse server functions as an application serviceprovider that provides various decision support system applications tothe customers. For example, the data warehouse server provides decisionsupport system applications to analyze and graphically display theresults of the analysis for a customer. The decision support systemapplications may be accessed through a web browser. In one embodiment,the customer servers are connected to the data warehouse server via theInternet and the data warehouse server provides data warehousingservices to multiple customers.

[0028] The data warehouse system may provide a data processor componentthat converts the log files into a format that is more conducive toprocessing by the decision support system applications. In oneembodiment, the converted data is stored in a data warehouse thatincludes fact and dimension tables. Each fact table contains entriescorresponding to a type of fact derived from the log files. For example,a web page access fact table may contain an entry for each web pageaccess identified in the log files. Each entry may reference attributesof the web page access, such as the identity of the web page andidentity of the accessing user. The values for each attribute are storedin a dimension table for that attribute. For example, a user dimensiontable may include an entry for each user and the entries of the webaccess fact table may include a user field that contains an index (orsome other reference) to the entry of the user dimension table for theaccessing user. The user dimension table may contain the names of theusers and other user-specific information. Alternatively, the userdimension table may itself also be a fact table that includes referencesto dimension tables for the attributes of users. The data warehouse mayalso include fact tables and dimension tables that represent high-levelfacts and attributes derived from the low-level facts and attributes ofthe log files. For example, high-level facts and attributes may not bederivable from only the data in a single log entry. For example, thehigher level category (e.g., shoes or shirts) of a web page may beidentified using a mapping of web page URIs to categories. Thesecategories may be stored in a category dimension table. Also, certainfacts, such as the collection of log entries that comprise a single userweb access session or visit, may only be derivable by analyzing a seriesof log entries.

[0029] The data processor component may have a parser component and aloader component. The parser of the data processor parses and analyzes alog file and stores the resulting data in a local data warehouse thatcontains information for only that log file. The local data warehousemay be similar in structure (e.g., similar fact and dimension tables) tothe main data warehouse used by decision support system applications.The local data warehouse may be adapted to allow efficient processing bythe parser. For example, the local data warehouse may be stored inprimary storage (e.g., main memory) for speed of access, rather than insecondary storage (e.g., disks). The parser may use parser configurationdata that defines, on a customer-by-customer basis, the high-level datato be derived from the log entries. For example, the parserconfiguration data may specify the mapping of URIs to web pagecategories. The loader of the data processor transfers the data from thelocal data warehouse to the main data warehouse. The loader may createseparate partitions for the main data warehouse. These separatepartitions may hold the customer data for a certain time period (e.g., amonth's worth of data). The loader adds entries to the main fact tables(i.e., fact tables of the main data warehouse) for each fact in a localfact table (i.e., fact table of the local data warehouse). The loaderalso adds new entries to the main dimension tables to representattribute values of the local dimension tables that are not already inthe main dimension tables. The loader also maps the local indices (orother references) of the local dimension tables to the main indices usedby the main dimension tables.

[0030]FIG. 2A is a block diagram illustrating components of the datawarehouse system in one embodiment. The data warehouse system includescustomer components that execute on the customer servers and datawarehouse components that execute on the data warehouse server. Thecustomer servers 210 and the data warehouse server 260 areinterconnected via the Internet 250. Customer components executing on acustomer server includes a data collection component 220 and a dataviewer 230. The data viewer may reside on a client computer of thecustomer, rather than a server. The data collection component collectsthe customer data from the storage devices 240 of the customer servers.The data viewer provides access for viewing of data generated by thedecision support system applications of the data warehouse server. Inone embodiment, the data viewer may be a web browser. The data warehouseserver includes a data receiver component 270, the data processorcomponent 280, the data warehouse 290, and decision support systemapplications 291. The data receiver component receives customer datasent by the data collection components executing at the various customerweb sites. The data processor component processes the customer data andstores it in the data warehouse. The decision support system applicationprovides the customer with tools for analyzing and reviewing thecustomer data that is stored in the main data warehouse. Analysisperformed on and reports generated from are described in U.S. patentapplication Ser. No. ______ (Attorney Ref. No. 34821-8010US), entitled“Identifying and Reporting on Combinations of Events in Usage Data,” andU.S. patent application Ser. No. ______ (Attorney Ref. No.34821-8013US), entitled “Extracting and Displaying Usage Data forGraphical Structures,” which are being filed concurrently and which arehereby incorporated by reference. In one embodiment, each customer hasits own set of dimension and fact tables so that the information ofmultiple customers are not intermingled.

[0031]FIG. 2B is a block diagram illustrating details of the componentsof the data warehouse system in one embodiment. The data collectioncomponent 220 includes a monitor sub-component 221 and a pitchersub-component 222. The data collection component is described in moredetail in U.S. patent application Ser. No. ______ (Attorney Ref. No.34821-8001US), entitled “Method and System for Monitoring Resource viathe Web,” which is being filed concurrently and which is herebyincorporated by reference. The pitcher is responsible for retrievinginstructions from the data warehouse server, collecting the customerdata in accordance with the retrieved instructions, and uploading thecustomer data to the data warehouse server. The monitor is responsiblefor monitoring the operation of the pitcher and detecting when thepitcher may have problems in collecting and uploading the customer data.When the monitor detects that a problem may occur, it notifies the datawarehouse server so that corrective action may be taken in advance ofthe collecting and uploading of the customer data. For example, thepitcher may use certain log on information (e.g., user ID and password)to access a customer web server that contains customer data to beuploaded. The monitor may use that log on information to verify that thelog on information will permit access to the customer data. Access maybe denied if, for example, a customer administrator inadvertentlydeleted from the customer web server the user ID used by the pitcher.When the monitor provides advance notification of a problem, the problemmight be corrected before the pitcher attempts to access the customerdata. The monitor also periodically checks the pitcher to ensure thatthe pitcher is executing and, if executing, executing correctly.

[0032] The data receiver component of the data warehouse server includesa status receiver sub-component 271, a catcher sub-component 272, an FTPserver 273, a status database 274, and a collected data database 275.The status receiver receives status reports from the customer serversand stores the status information in the status database. The catcherreceives and processes the customer data that is uploaded from thecustomer web sites and stores the data in the collected data database.The data processor component includes a parser sub-component 281 and aloader sub-component 282. The parser analyzes the low-level events ofthe customer data and identifies high-level events and converts thecustomer data into a format that facilitates processing by the decisionsupport system applications. The loader is responsible for storing theidentified high-level events in the data warehouse 290. In oneembodiment, a customer may decide not to have the data collectioncomponent executing on its computer systems. In such a case, thecustomer server may include an FTP client 245 that is responsible forperiodically transferring the customer data to the FTP server 273 of thedata warehouse server. The data receiver may process this customer dataat the data warehouse server in the same way as the pitcher processesthe data at the customer servers. The processed data is then stored inthe collected data database.

[0033]FIG. 3 is a block diagram illustrating the sub-components of thedata processor component in one embodiment. The data processor component300 includes a parser 310, data storage area 320, and a loader 330. Thedata processor component inputs parser configuration data 340 and a logfile 350 and updates the main data warehouse 360. The parserconfiguration data may include a mapping of actual web sites to logicalsites and a mapping of a combination of Uniform Resource Identifiers(“URIs”) and query strings of the log entries to page definitions (e.g.,categories) and event definitions. The parser processes the entries ofthe log file to generate facts and dimensions to eventually be stored inthe main data warehouse. The parser identifies events in accordance withthe parser configuration data. The parser includes a filter log entrycomponent 311, a normalize log entry component 312, a generatedimensions component 313, an identify sessions component 314, and agenerate aggregate statistics component 315. The filter log entrycomponent identifies which log entries should not be included in themain data warehouse. For example, a log entry that has an invalid formatshould not be included. The normalize log entry component normalizes thedata in a log entry. For example, the component may convert all times toGreenwich Mean Time (“GMT”). The generate dimensions componentidentifies the various dimensions related to a log entry. For example, adimension may be the Uniform Resource Identifier of the entry or thelogical site identifier. The identify sessions component processes theparsed log file data stored in the local data warehouse to identify usersessions. A user session generally refers to the concept of a series ofweb page accesses that may be related in some way, such as by temporalproximity. The generate aggregate statistics component aggregates datafor the log file being processed as each log entry is processed or afterthe log file is parsed. The data storage area 320 includes a local datawarehouse 321. In one embodiment, the local data warehouse is storednon-persistently (or temporarily) in main memory of the computer system.The local data warehouse may contain fact tables and dimension tablesthat correspond generally to the tables of the main data warehouse 360.The loader retrieves the information from the local data warehouse andstores the information in the main data warehouse. The loader includes acreate partitions component 331, a load dimension table component 332,and a load fact table component 333. The create partitions componentscreates new partitions for the main data warehouse. A partition maycorrespond to a collection of information within a certain time range.For example, the main data warehouse may have a partition for eachmonth, which contains all the data for that month. The load dimensiontable component and the load fact table component are responsible forloading the main data warehouse with the dimensions and facts that arestored in the local data warehouse.

[0034] In one embodiment, the log file is a web server log file of acustomer. The log file may be in the “Extended Log File Format” asdescribed in the document “http://www.w3.org/TR/WD-logfile-960323”provided by the World Wide Web Consortium, which is hereby incorporatedby reference. According to that description, the log file contains linesthat are either directives or entries. An entry corresponds to a singleHTTP transaction (e.g., HTTP request and an HTTP response) and consistsof a sequence of fields (e.g., integer, fixed, URI, date, time, andstring). The meaning of the fields in an entry is specified by a fielddirective specified in the log file. For example, a field directive mayspecify that a log entry contains the fields date, time, client IPaddress, server IP address, and success code. Each entry in the log filewould contain these five fields.

[0035] The parser configuration data defines logical sites, pagedefinitions, and event definitions. A logical site is a collection ofone or more IP addresses and ports that should be treated as a singleweb site. For example, a web site may actually have five web serverswith different IP addresses that handle HTTP requests for the samedomain. These five IP addresses may be mapped to the same logical siteto be treated as a single web site. The page definitions define theformat of the URIs of log entries that are certain page types. Forexample, a URI with a query string of “category=shoes” may indicate apage type of “shoes.” Each event definition defines an event type and avalue for that event type. For example, a log entry with a query stringthat includes “search=shoes” represents an event type of “search” withan event value of “shoes.” Another log entry with a query string of“add=99ABC” may represent an event type of “add” an item to the shoppingcall with an event value of item number “99ABC.”

[0036]FIG. 4 is a block diagram illustrating some of the tables of thelocal data warehouse and the main data warehouse in one embodiment.These data warehouses are databases that include fact tables anddimension tables. A fact table contains an entry for each instance offact (e.g., web page access). A dimension table contains an entry foreach possible attribute value of an attribute (e.g., user). The entriesof a fact table contain dimension fields that refer to the entries intothe dimension tables for their attribute values. A table may be both afact table and a dimension table. For example, a user dimension tablewith an entry for each unique user may also be a fact table that refersto attributes of the users that are stored in other dimension tables.The data warehouses contain a log entry table 401, a user table 402, alogical site table 403, a URI table 404, a referrer URI table 405, apage type table 406, event type tables 407, a query string table 408,and a referrer query string table 409. The log entry table is a facttable that contains an entry for each log entry that is not filtered outby the parser. The other tables are dimension tables for the log entrytable. The user table contains an entry for each unique user identifiedby the parser. The logical site table contains an entry for each logicalsite as defined in the parser configuration data. The URI table containsan entry for each unique URI of an entry in the log entry table. Thereferrer URI table contains an entry for each referrer URI of the logentry table. The page type table contains an entry for each page typeidentified by the parser as defined in the parser configuration data.The data warehouse contains an event table for each type of eventdefined in the parser configuration data. Each event table contains anentry for each event value of that event type specified in an entry ofthe log entry table. The query string table contains an entry for eachunique query string identified in an entry of the log entry table. Thereferrer query string contains an entry for each unique referrer querystring identified in an entry of the log entry table.

[0037] Table 1 is an example portion of a log file. The “#fields”directive specifies the meaning of the fields in the log entries. Eachfield in a log entry is separated by a space and an empty field isrepresented by a hyphen. The #fields directive in this example indicatesthat each entry includes the date and time when the transaction wascompleted (i.e., “date” and “time”), the client IP address (i.e.,“c-ip”), and so on. For example, the first log entry has a data and timeof “2000-06-01 07:00:04” and a client IP address of “165.21.83.161.”TABLE 1 #Software: Microsofl Internet Information Server 4.0 #Version:1.0 #Date: 2000-06-01 07:00:04 #Fields: date time c-ip cs-usernames-sitename s-computername s-ip cs-method cs-uri-stem es-uri-querysc-status sc-win32- status sc-bytes cs-bytes time-taken s-portcs-version cs(User-Agent) cs(Cookie) cs(Referrer) 2000-06-01 07:00:04165.21.83.161 - W3SVC2 COOK_002 206.191.163.41 GET /directory/28.ASP -200 0 148428 369 9714  80  HTTP/1.0Mozilla/3.04+(Win95;+1)  ASPSESSIONIDQQGGQGPG=JBCCFIPBBHHDANBAFFIGLGPHhttp://allrecipes.com/Default.asp 2000-06-01 07:00:20 4.20.197.70 -W3SVC2 COOK_002 206.191.163.41 GET /Default.asp - 302 0 408 259 30 80HTTP/1.0 Mozilla:4.0 +(compatible:+Keynote-Perspective+4.0) - -2000-06-01 07:00:20 4.20.197.70 - W3SVC2 COOK_002 206.191.163.41 GET/Default.asp 200 0 41245 266 200 80 HTTP/1.0 Mozilla/4.0+(compatible:Keynote-Perspeetive+4.0) - - 2000-06-01 07:00:27 204.182.65.192 - W3SVC2COOK_002 206.191.163.41 HEAD /Default.asp - 302 0 254 66 40 80 HTTP/1.0lpswitch_WhatsUp/3.0 - - 2000-06-01 07:00:32 24.10.69.137 - W3SVC2COOK_002 206.191.163.41 GET /directory/541.asp - 200 0 22427 459 42180  HTTP/1.0  Mozilla/4.7+[en]+(Win98:+U)  ASPSESSIONIDQQGGQGPG=BHBCFIPBEJPNOMDPKCGLKNGC;+ARSiteUser=1%2DC2B25364%2D3775%2D11D4%2DBACI%2D0050049BD2E4;+ARSites=ALR=1http://allrecipes.com/directory/34.asp 2000-06-01 07:00:34192.102.216.101 - W3SVC2 COOK_002 206.191.163.41 GET/encye/terms/L/7276.asp - 200 0 20385  471  290  80  HTTP/1.0Mozilla/4.7+[en]+(XII;+1:+SunOS+5.5.1+sun4u)  ASPSESSIONIDQQGGQGPG=PKBCFIPBIKONBPDHKDMMEHCEhttp://search.allrecipes.com/gsearchresults.asp?site=allrecipes&allrecipes=allrecipes&allsites=1&q1=loin 2000-06-01 07:00:34 216.88.216.227 - W3SVC2 COOK_002206.191.163.41 GET /default.asp - 200 0 41253 258 180 80 HTTP/1.1Mozilla/4.0+(compatible:+MISE+4.01:+MSN+2.5;+MSN+2.5;+Windows+98) - -2000-06-01 07:00:36 199.203.410 - W3SVC2 COOK_002 206.191.163.41 GETDefault.asp - 302 0 408 485 30 80 HTTP/1.0Mozilla/4.0+(compatible:+MSII;+5.01:+Windows+98;+TUCOWS)SITESERVER=ID=22f117fb3708b2278f3c 426796a78e2a - 2000-06-01 07:00:37199.203.4.10 - W3SVC2 COOK_002 206.191.163.41 GET /Default.asp - 200 041277 492 421 80 HTTP/1.0Mozilla/4.0+(compatible:+MSII:+5.01:+Windows+98:+TUCOWS)SITESERVER=ID=22f171fb3708b2278f3c 426796a78e2a - 2000-06-01 07:00:4324.10.69.137 - W3SVC2 COOK_002 206.191.163.41 GET /directory/34.asp -200 0 17835 458 32080  HTTP/1.0  Mozilla/4.7+[en](Win98:+U)ASPSESSIONIDQQGGQGPG=BHBCFIPBEJPNOMDPKCGLKNGC;+ARSiteUser=1%2DC2B255364%2D3775%211D4%2D0050049BD2E4;+ARSites=ALR=1  http://allrecipes. com/directory/25.asp2000-06-01  07:00:47  199.203.4.10  -  W3SVC2  COOK_002  206.191.163.41  GET  /jumpsite.aspjumpsite=5&Go.x=16&Go.y=14 302 0 341 611 40 80 HTTP/1.0Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+98;+TUCOWS)  SITESERVER=ID=22f117fb3708b2278f3c426796a78e2a;+ASPSESSIONIDQQGGQGPG=FCCCFIPBKJMBDJJHBNCOEDGH Http://allrecipes.com/Default.asp 2000-06-01 07:00:4724.10.69.137 - W3SVC2 COOK_002 206.191.163.41 GET /directory/538.asp -200 0 27471 459 88180  HTTP/1.0  Mozilla/4.7+[en]+(Win98;+U)  ASPSESSIONIDQQGGQGPG=BHBCFIPBEJPNOMDPKCGLKNGC;+ARSiteUser=1%2DC2B25364%2D3775%2DBAC1%2D0050049BD2E4;+ARSites=ALR=1http://allrecipes.com/directory/34.asp 2000-06-01 07:00:47207.136.48.117 - W3SVC2 COOK_002 206.191.163.41 GET /directory/511.asp -200 0 77593 369 12538 80 HTTP/1.0 Mozilla/3.01Gold+(Win95:+I)ASPSESSIONIDQQGGQGPG=MFACFIPBDBN PBFPBOENJKHJN;+ARSiteUser=1%2DC2B251E5%2D3775%2D11D4%2DBAC1%2D0050049BD2E4;+ARSites=ALR=1http://allrecipes.com/directory/506.asp 2000-06-01 07:00:49192.102.216.101 - W3SVC2 COOK_002 206.191.163.41 GET /encyc/A1.aspARRetSite= 15&ARRefCookie=1-C2B253B8-3775-11D4-BAC1-0050049BD2E4 200 047193 457 260 80 HTTP/1.0 Mozilla/4.7+[en]+(XII;+SunOS+5.5.1+sun4u)    ASPSESSIONIDQQGGQGPG=PKBCFIPBIKONBPDHKDMMEHCEhttp://porkrecipe.com/hints/tips.asp

[0038] Table 2 is an example portion of parser configuration data. Thelogical site definitions map a server IP address, port, and root URI toa logical site. For example, the entry“LOGICALSITEURIDEFINITION=209.114.94.26,80,/,1” maps all the accesses toport 80 of IP address 209.114.94.26 at URIs with a prefix “/” to logicalsite 1. The page type definitions map a logical site identifier, URIpattern, and query string pattern to a page type. For example, the entry“PAGEKEYDEFINITION=news item, news item, 1,{prefix}=homepage_include/industrynews_detail. asp, ,<NewsItemID>#{Uri}”indicates that a page type of “news item” is specified for logical site1 by a URI pattern of “/homepage_include/industrynews_detail.asp.” Thedefinition also indicates that the event value is “<NewsItemID>#{Uri},”where the URI of the log entry is substituted for “{Uri} and the valueof NewsItemID in the query string is substituted for “<NewsItemID>.” Theevent type definitions map a site identifier, URI pattern, and querystring pattern to an event type and value. The definitions also specifythe name of the event type and the name of the dimension table for thatevent type. For example, the entry “EVENTDEFINITION=View News Article,View News Article, 1,{prefix}=/homepage_include/industrynews_detail.asp, <NewsItemId>=*,<NewsItemId>” indicates that View News Article event types are stored inthe View News Article dimension table. That event type is indicated by aURI with “/homepage_include/industrynews_detail.asp,” and the eventvalue is the string that follows “<NewsItemId>=” in the query string.TABLE 2 LOGICALSITEURIDEFINITION= 209.114.94.26.80./.1PAGEKEYDEFINITION= news item. news item. 1.{prefix}=/homepage_include/industrynews_detaii.asp., <NewsItemId>#{Uri}PAGEKEYDEFINITION= page, page, 1,,, {Uri} EVENTDEFINITION= Login. Login.1, {prefix}=/registration/login.asp., EVENTDEFINITION= Logout. Logout.1, {prefix}/registration/logout.asp., EVENTDEFINITION= Register Page 1,Register Page 1, 1, {prefix}=/registration/register.asp.,EVENTDEFINITION= Register Page 2, Register Page 2, 1,{prefix}=/registration/register2.asp, <Used ID>=*, EVENTDEFINITION=Registration Confirmation, Registration Confirmation, 1,{prefix}=registration/register3.asp., EVENTDEFINITION= AbortRegistration, Abort Registration, 1,{prefix}registration/registrationabort.asp., EVENTDEFINITION= MemberServices, Member Services, 1,{prefix}=/registration/memberservices.asp., EVENTDEFINITION= ChangePassword, Change Password, 1, {prefix}/registration/changepassword.asp.,EVENTDEFINITION= Profile Edit Profile Edit, 1,{prefix}=/registration/profile.asp., EVENTDEFINITION= ChangeAffiliation, Change Affiliation, 1,{prefix}=/registration/changeaffiliation.asp,<UserID>=*,EVENTDEFINITION= Change Secret Question, Change Secret Question, 1,{prefix}=/registration/changesecretquestion.asp., EVENTDEFINITION=Forgot Information, Forgot Information, 1,{prefix}=/registration/forgotinfo.asp., EVENTDEFINITION= ForgotPassword, Forgot Password, 1, {prefix}/registration/forgotpassword.asp.,EVENTDEFINITION= Forgot Signin, Forgot Signin, 1,{prefix}=/registration/forgotsignin.asp., EVENTDEFINITION= View NewsArticle, View News Article, 1,{prefix}=/homepage_include/industrynews_detail.asp,<NewsItemId>=*.<NewsItemId>

[0039] FIGS. 5-14 are flow diagrams of components of the parser in oneembodiment. FIG. 5 is a flow diagram illustrating the parse log dataroutine that implements the main routine of parser in one embodiment.The routine processes each entry in the log file based on the parserconfiguration data. The routine filters out certain log entries,normalizes the attribute values of the log entries, and generatesentries in the dimension tables for the attributes of the log entries.After processing all the log entries, the parser identifies usersessions and generates various statistics. In blocks 501-508, theroutine loops selecting and processing each log entry. In block 501, theroutine selects the next log entry of the log file starting with thefirst log entry. The routine may also pre-process the header informationof the log file to identify the fields of the log entries. In decisionblock 1502, if all the log entries have already been selected, then theroutine continues at block 509, else the routine continues at block 503.In block 503, the routine extracts the values for the fields of theselected log entry. In block 504, the routine invokes the filter logentry routine, which returns an indication as to whether the selectedlog entry should be filtered out. In decision block 505, if the filterlog entry routine indicates that the selected log entry should befiltered out, then the routine skips to block 508, else the routinecontinues at block 506. In block 506, the routine invokes the normalizelog entry routine to normalize the values of the fields of the selectedlog entry. In block 507, the routine invokes the generate dimensionsroutine to update the dimension tables based on the selected log entryand to add an entry into the log entry fact table. In block 508, theroutine updates the statistics for the log file. For example, theroutine may track the number of log entries that have been filtered out.The routine then loops to block 501 to select the next log entry. Inblock 509, the routine outputs the log file statistics. In block 510,the routine invokes the identify sessions routine that scans the logentry table to identify the user sessions and updates a sessiondimension table. In block 511, the routine invokes the generateaggregate statistics routine to generate various statistics and thencompletes.

[0040]FIG. 6 is a flow diagram of the filter log entry routine in oneembodiment. The filter log entry routine is passed a log entry anddetermines whether the log entry should be filtered out. In blocks601-607, the routine determines whether the filter out conditions havebeen satisfied. In decision block 601, the routine determines whetherthe log entry has a field count problem. A field count problem ariseswhen the number of fields in the log entry does not correspond to thenumber of expected fields for that log entry. The number and types offields may be defined in a “fields” directive line of the log file. Indecision block 602, the routine determines whether the log entry isoutside of a specified time range. The routine compares the time fieldof the log entry to the time range. The time range may be specified sothat only those log entries within that time range are processed. Indecision block 603, the routine determines whether the IP address of thelog entry should be ignored. For example, a log entry may be ignored ifthe entry originated from a server whose function is to ping thecustomer's web server at periodic intervals. In decision block 604, theroutine determines whether the log entry corresponds to a comment (e.g.,a “#remarks” directive). In decision block 605, the routine determineswhether the success code associated with the log entry indicates thatlog entry should be ignored. For example, if the success code indicatesa failure, then the log entry may be ignored. In decision block 606, theroutine determines whether the log entry is requesting a resource whoseextension indicates that the log entry should be ignored. For example,the routine may ignore log entries requesting graphic files, such asthose in the “.gif” format. In decision block 607, the routinedetermines whether the values within the fields of the log entry arecorrupt. For example, a value in the date field that indicates a date ofFebruary 30th is corrupt. One skilled in the art would appreciate thatthe various filtering conditions may be specified in a configurationfile. For example, the time range, IP addresses, and so on may bespecified in the configuration file. These configuration files may bespecified on a customer-by-customer basis.

[0041]FIG. 7 is a flow diagram illustrating the normalize log entryroutine. The routine normalizes the values of the fields in the passedlog entry. In block 701, the routine converts the time of the log entryinto a standard time such as Greenwich Mean Time. In block 702, theroutine corrects the time based on the variation between the times ofthe customer web servers. For example, the time of one web server may befive minutes ahead of the time of another web server. This correctionmay be based on current time information collected from computer systemsthat generated the events and then correlated to base current timeinformation. In block 703, the routine normalizes the values of thefields of the log entry. This normalization may include processingsearch strings to place them in a canonical form. For example, a searchstring of “back pack” may have a canonical form of “backpack.” Othernormalization of search strings may include stemming of words (e.g.,changing “clothes” and “clothing” to “cloth”), synonym matching, andfirst and last word grouping. The first word grouping for the searchstrings of “winter clothing” and “winter shoes” results in the string of“winter.”

[0042]FIG. 8 is a flow diagram of the generate dimensions routine in oneembodiment. This routine identifies a value for each dimensionassociated with the passed log entry and ensures that the dimensiontables contains entries corresponding to those values. In oneembodiment, each entry in a dimension table includes the attribute value(e.g., user identifier) and a hash value. The hash value may be used bythe loader when transferring information to the main data warehouse.Also, each entry has a local identifier, which may be an index into thelocal dimension table. The loader maps these local identifiers to theircorresponding main identifiers that are used in the main data warehouse.In block 801, the routine invokes a routine that identifies the logicalsite associated with the log entry and ensures that an entry for thelogical site is in the logical site dimension table. In block 802, theroutine invokes a routine that identifies the user associated with thelog entry and ensures that an entry for the user is in the userdimension table. In block 803, the routine invokes a routine thatidentifies the URI associated with log entry and ensures that an entryfor that URI is in the URI dimension table. In block 804, the routineinvokes a routine that identifies the page type based on the parserconfiguration data and ensures that an entry for that page type is inthe page type dimension table. In block 805, the routine invokes aroutine that identifies the various events associated with the log entrybased on the parser configuration data and ensures that an entry foreach event type is in the corresponding event table. In block 806, theroutine identifies other dimensions (e.g., referrer URI) as appropriate.In block 807, the routine adds an entry to the log entry table that islinked to each of the identified dimensions using the local identifiers.In block 808, the routine updates the statistics information based onthe log entry and then returns.

[0043]FIG. 9 is a flow diagram of the identify logical site routine inone embodiment. This routine compares the site information of the passedlog entry with the logical site definitions in the parser configurationdata. In block 901, the routine selects the next logical site definitionfrom the parser configuration data. In decision block 902, if all thelogical site definitions have already been selected, then the routinecontinues the block 905, else the routine continues at block 903. Indecision block 903, if the URI of the log entry matches the selectedlogical site definition, then the routine continues at block 904, elsethe routine loops to block 901 to select the next logical sitedefinition. In block 904, the routine updates the logical site dimensiontable to ensure that it contains an entry for the logical site definedby the selected logical site definition. The routine then returns. Inblock 905, the routine updates the logical site dimension table toensure that it contains a default logical site definition and thenreturns. The log entries that do not map to a logical site definitionare mapped to a default logical site.

[0044]FIG. 10 is a flow diagram of the identify user routine in oneembodiment. This routine may use various techniques to identify the userassociated with the passed log entry. In one embodiment, the selectionof the technique is configured based on the customer web site. Forexample, one customer may specify to use a cookie to identify users. Inabsence of a user identifier in the cookie, the industry norm is toidentify users based on their IP addresses. This routine illustrates atechnique in which a combination of cookies and IP addresses are used toidentify a user. In block 1001, the routine extracts the user identifierfrom the cookie associated with the log entry. The format of a cookiemay be specified on a customer-by-customer basis. In decision block1002, if the extraction from the cookie was successful, then the routinecontinues at block 1006, else the routine continues at block 1003. Theextraction may not be successful if, for example, the log entry did notinclude a cookie. In block 1003, the routine extracts the IP addressfrom the log entry. In decision block 1004, if the IP address isdetermined to be unique, then routine continues at block 1006, else theroutine continues at block 1005. Certain IP addresses may not be unique.For example, an Internet service provider may use one IP address formany of its users. The Internet service provider performs the mapping ofthe one IP address to the various users. In block 1005, the routineextracts the browser identifier from the log entry. The combination ofIP address and browser identifier may uniquely identify a user. In block1006, the routine updates the user dimension table to ensure that it hasan entry for this user and then returns.

[0045]FIG. 11 is a flow diagram of the identify page type routine in oneembodiment. This routine uses the page type definitions of the parserconfiguration data to identify the page type associated with the logentry. In block 1101, the routine selects the next page type definitionfrom the parser configuration data. In decision block 1101, if all thepage type definitions have already been selected, then no matching pagetype has been found and the routine returns, else the routine continuesat block 1103. In decision block 1103, if the log entry matches theselected page type definition, then the routine continues at block 1104,else the routine loops to block 1101 to select the next page typedefinition. In block 1104, the routine updates the page type dimensiontable to ensure that it contains an entry for the page type representedby the selected page type definition. The routine then returns.

[0046]FIG. 12 is a flow diagram illustrating the identify events routinein one embodiment. This routine determines whether the log entrycorresponds to any of the events specified in the parser configurationdata. In block 1201, the routine selects the next type of event from theparser configuration data. In decision block 1202, if all the eventtypes have already been selected, then the routine returns, else theroutine continues at block 1203. In block 1203, the routine selects thenext event definition of the selected event type. In decision block1204, if all the event definitions of the selected event type havealready been selected, then the log entry does not correspond to thistype of event and the routine loops to block 1201 to select the nexttype of event, else the routine continues at block 1205. In block 1205,if the log entry matches the selected event definition, then the routinecontinues at block 1206, else the routine loops to block 1203 to selectthe next event definition of the selected event type. In block 1206, theroutine updates the dimension table for the selected type of the eventto ensure that it contains an entry for the selected event definition.The routine then loops to block 1201 to select the next type of event.In this way, the routine matches no more than one event definition for agiven event type. For example, if there are two event definitions forthe event type “Keyword Search,” then if the first one processedmatches, then the second one is ignored.

[0047]FIG. 13 is a flow diagram illustrating the identify sessionsroutine in one embodiment. This routine scans the log entry table of thelocal data warehouse to identify user sessions. In one embodiment, auser session may be delimited by a certain period of inactivity (e.g.,thirty minutes). The criteria for identifying a session may beconfigurable on a customer-by-customer basis. In block 1301, the routineselects the next user from the user dimension table. In decision block1302, if all the users have already been selected, then the routinereturns, else the routine continues at block 1303. In block 1303, theroutine selects the next log entry for the selected user in time order.In decision block 1304, if all log entries for the selected user havealready been selected, then the routine loops to block 1301 to selectthe next user, else the routine continues at block 1305. In decisionblock 1305, if the selected log entry indicates that a new session isstarting (e.g., its time is more than 30 minutes greater than that ofthe last log entry processed), then the routine continues at block 1306,else the routine loops to block 1303 to select the next log entry forthe selected user. In block 1306, the routine updates a session facttable to add an indication of the new session. The routine then loops toblock 1303 to select the next log entry for the selected user. Theroutine may also update the log entries to reference their sessions.

[0048]FIG. 14 is a flow diagram of the generate aggregate statisticsroutine in one embodiment. This routine generate statistics based onanalysis of the fact and dimension tables used by the parser. In block1401, the routine selects the next fact table of intent. In decisionblock 1402, if all the fact tables have already been selected, then theroutine returns, else the routine continues at block 1403. In block1403, the routine selects the next entry of the selected fact table. Indecision block 1404, if all the entries of the selected fact table havealready been selected, then the routine loops to block 1401 to selectthe next fact table, else the routine continues at block 1405. In block1405, the routine aggregates various statistics about the selected facttable. The routine then loops to block 1404 to select the next entry ofthe fact table.

[0049] FIGS. 15-17 are flow diagrams illustrating components of theloader in one embodiment. FIG. 15 is a flow diagram of the load log dataroutine implementing the main routine of the loader in one embodiment.This routine controls the moving of the data from the local datawarehouse (created and used by the parser) into the main data warehouse.In block 1501, the routine invokes the create partitions routine tocreate partitions for the main data warehouse as appropriate. In blocks1502-1504, the routine loops loading the dimension tables into the maindata warehouse. In block 1502, the routine selects the next dimensiontable. In decision block 1503, if all the dimension tables have alreadybeen selected, then the routine continues at block 1505, else theroutine continues at block 1504. In block 1504, the routine invokes theload dimension table routine for the selected dimension table. Theroutine then loops to block 1502 to select the next dimension table. Inblocks 1505-1507, the routine loops adding the entries to the facttables of the main data warehouse. In block 1505, the routine selectsthe next fact table in order. The order in which the fact tables are tobe loaded may be specified by configuration information. The fact tablesmay be loaded in order based on their various dependencies. For example,a log entry fact table may be dependent on a user dimension table thatis itself a fact table. In decision block 1506, if all the fact tableshave already been loaded, then the routine returns, else the routinecontinues at block 1507. In block 1507, the routine invokes the loadfact table routine for the selected fact table. The routine then loopsto block 1505 to select the next fact table.

[0050]FIG. 16 is a flow diagram of the load dimension table routine inone embodiment. This routine maps the local identifiers used in thelocal data warehouse to the main identifiers used in the main datawarehouse. In block 1601, the routine selects the next entry from thedimension table. In decision block 1602, if all the entries of thedimension table have already been selected, then the routine returns,else the routine continues at block 1603. In block 1603, the routineretrieves an entry from the dimension table of the main data warehousecorresponding to the selected entry. In decision block 1604, if theentry is retrieved, then the routine continues at block 1606, else thedimension table does not contain an entry and the routine continues atblock 1605. In block 1605, the routine adds an entry to the dimensiontable of the main data warehouse corresponding to the selected entryfrom the dimension table of the local data warehouse. In block 1606, theroutine creates a mapping of the local identifier (e.g., index into thelocal dimension table) of the selected entry to the main identifier(e.g., index into the main dimension table) for that selected entry. Theroutine then loops to block 1601 to select the next entry of thedimension table.

[0051]FIG. 17 is a flow diagram of the load fact table routine in oneembodiment. This routine adds the facts of the local data warehouse tothe main data warehouse. The routine maps the local identifiers for thedimensions used in the local warehouse to the main identifiers ofdimensions used in the main data warehouse. In block 1701, the routineselects the next entry in the fact table. In decision block 1702, if allthe entries of the fact table have already been selected, then theroutine returns, else the routine continues at block 1703. In block1703, the routine selects the next dimension for the selected entry. Indecision block 1704, if all the dimensions for the selected entry havealready been selected, then the routine continues at block 1706, elsethe routine continues at block 1705. In block 1705, the routineretrieves the main identifier for the selected dimension and then loopsto block 1703 to select the next dimension. In block 1706, the routinestores an entry in the fact table of the main data warehouse. Theroutine then loops to block 1701 to select the next entry in the facttable.

[0052]FIG. 18 is a flow diagram illustrating the identify user aliasesroutine in one embodiment. This routine tracks the different useridentifiers as a user switches from one web site to another. Inparticular, the routine maps the user identifiers used by a referrer website to the user identifiers used by the referred-to web site. In thisway, the same user can be tracked even though different web sites usedifferent identifiers for that user. This routine may be invoked as partof the parsing of the log files. In decision block 1801, if the logentry indicates a referrer web site, then the routine continues at block1802, else the routine returns. In block 1802, the routine identifiesthe user identifier for the referrer web site. In block 1803, theroutine creates a mapping between the referrer user identifier and thereferred-to user identifier. The routine then returns.

[0053] From the above description it will be appreciated that althoughspecific embodiments of the technology have been described for purposesof illustration, various modifications may be made without deviatingfrom the spirit and scope of the invention. For example, the processingof the parser may be performed by the data collection component beforesending the data to the data warehouse server. Accordingly, theinvention is not limited except by the appended claims.

1. A method of processing data before updating a database based on theprocessed data, the database having a main table with a main identifierfor each entry in the main table, the method comprising: identifying anentry that should be in the main table; generating a local identifierfor the entry; adding the entry with the local identifier to a localtable; generating information to be stored in the database thatreferences the entry in the local table using the local identifier; andafter generating the information, generating a main identifier for theentry; adding an entry with the main identifier to the main table; andstoring the generated information in the database with the localidentifier replaced with the main identifier.
 2. The method of claim 1wherein the local table is stored in main memory.
 3. The method of claim1 wherein the table is a dimension table of the database.
 4. The methodof claim 1 wherein the generated information is stored in a fact tableof the database.
 5. The method of claim 1 wherein the local identifieris generated based on key information.
 6. The method of claim 5including generating a hash value based on the key information andstoring the hash value in the entry of the local table.
 7. The method ofclaim 6 including using the stored hash value to locate entries in themain table.
 8. The method of claim 1 wherein the processed data relatesto navigation information of a web site.
 9. The method of claim 1wherein the processed data is click stream data.
 10. A method in acomputer system for parsing information before updating data in a maindatabase, the main database having fact tables and dimension tables, themethod comprising; creating a fact table and a dimension tablecorresponding to a fact table and dimension table of the main database;identifying from the information entries for the created fact table anddimension table; storing the identified entries in the created facttable and dimension table; and analyzing the entries stored in thecreated fact table and dimension table.
 11. The method of claim 10wherein the parsed information relates to user interactions with webpages.
 12. The method of claim 10 including updating data in the maindatabase based on the entries in the created fact table and dimensiontable.
 13. The method of claim 10 wherein the created fact table anddimension table are stored in main memory.
 14. The method of claim 10wherein the created fact and dimension table are stored innon-persistent memory.
 15. The method of claim 10 wherein the createdfact and dimension table are stored in temporary memory.
 16. The methodof claim 10 wherein the created fact and dimension table are destroyedafter information in the main database is updated based on the createdfact and dimension table.
 17. The method of claim 10 wherein the data inthe main database is updated based on an ordering of fact and dimensiontables.
 18. A method in a computer system for identifying navigationpaths through web pages based oil user navigation information, themethod comprising: analyzing the user navigation information to identifyentries associated with the same user; and for each user, identifyingthe web pages that are accessed by the user; and storing an indicationof the identified web pages as a navigation path in a persistentdatabase.
 19. The method of claim 18 wherein a sequence of identifiedweb pages for a user is designated as a session.
 20. The method of claim19 wherein a session is delimited by time between access of web pages bythe user.
 21. A method in a computer for processing navigationinformation for web pages, the method comprising: selecting an entry ofthe navigation information; identifying a uniform resource identifier ofthe selected entry; and when the identified uniform resource identifiersatisfies a match criterion, storing in a persistent database anindication that the entry matches the criterion.
 22. The method of claim21 wherein criterion indicates that the entry is for a web pages of acertain category.
 23. The method of claim 21 wherein the criterionidentifies an event.
 24. The method of claim 21 wherein the criterionspecifies a type of web page.
 25. A method in a computer system foridentifying a user who accesses a web page, the method comprising:providing an indication of a request for a second web page, the requestincluding second information identifying a user who requested the secondweb page and first information identifying a user who requested a firstweb page that included a reference to the second web page; andindicating the first information and the second information identify thesame user.
 26. The method of claim 25 wherein the first information isincluded as information of a referrer.
 27. The method of claim 25including storing a mapping of the first information to the secondinformation.
 28. The method of claim 27 including checking the mappingto determine whether user information corresponds to a user identifiedwith other information.
 29. The method of claim 25 wherein the firstinformation and second information are provided by different webdomains.
 30. A method in a computer system for identifying high-levelevents from low-level events, the method comprising: providing aplurality of event definitions that map low-level events to high-levelevents; and for each low-level event, determining whether the low-levelevent matches a provided event definition; and when a low-level eventmatches an event definition, persistently storing an indication of thehigh-level event associated with the matching event definition.
 31. Themethod of claim 30 wherein the low-level events are navigation events.32. The method of claim 30 wherein the low-level events are derived fromclick stream information.
 33. The method of claim 30 wherein theindication of the high-level event is persistently stored in a datawarehouse.
 34. A method in a computer system for processing click streamdata, the method comprising: receiving time synchronization informationfor the click stream data; adjusting times associated with the clickstream data based on the received time synchronization information; andpersistently storing the adjusted times.
 35. The method of claim 34wherein the time synchronization information is based on current timeinformation received from a web server associated with the click streamdata.
 36. The method of claim 35 including sending a request to the webserver for current time information.
 37. The method of claim 34 whereinthe time synchronization information is based on current timeinformation of a web server and base time information.
 38. Acomputer-readable medium containing instructions for controlling acomputer system to parse information before updating data in a maindatabase, the main database having fact tables and dimension tables, bya method comprising: creating fact tables and dimension tablescorresponding to fact tables and dimension tables of the main database;identifying from the information entries for the created fact tables anddimension tables; storing the identified entries in the created facttables and dimension tables; and analyzing the entries stored in thecreated fact tables and dimension tables.
 39. The computer-readablemedium of claim 38 wherein the parsed information relates to userinteractions with web pages.
 40. The computer-readable medium of claim38 including updating data in the main database based on the entries inthe created fact tables and dimension tables.
 41. The computer-readablemedium of claim 38 wherein the created fact tables and dimension tablesare stored in main memory.
 42. The computer-readable medium of claim 38wherein the created fact tables and dimension tables are stored innon-persistent memory.
 43. The computer-readable medium of claim 38wherein the created fact and dimension tables are stored in temporarymemory.
 44. The computer-readable medium of claim 38 wherein the createdfact and dimension tables are destroyed after information in the maindatabase is updated based on the created fact and dimension tables. 45.The computer-readable medium of claim 38 wherein the data in the maindatabase is updated based on an ordering of fact and dimension tables.46. A method in a computer system for processing of search strings, themethod comprising: identifying a search string in a log file;identifying a canonical form of the search string; and storing theidentified canonical form so that subsequent processing of the searchstring uses the canonical form.
 47. The method of claim 46 wherein theidentifying of a canonical form includes stemming of a word in thesearch string.
 48. The method of claim 46 wherein the identifying of acanonical form includes first or last word grouping.
 49. The method ofclaim 46 wherein the identifying of a canonical form includes removal ofspaces between words.